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ABSTRACT 

This study presents a divide-and-conquer (DC) approach 
based on feature space decomposition for classification. 
When large-scale datasets are present, typical approaches 
usually employed truncated kernel methods on the feature 
space or DC approaches on the sample space. However, 
this did not guarantee separability between classes, owing 
to overfitting. To overcome such problems, this work pro¬ 
poses a novel DC approach on feature spaces consisting of 
three steps. Firstly, we divide the feature space into several 
subspaces using the decomposition method proposed in this 
paper. Subsequently, these feature subspaces are sent into 
individual local classifiers for training. Finally, the outcomes 
of local classifiers are fused together to generate the final 
classification results. Experiments on large-scale datasets 
are carried out for performance evaluation. The results show 
that the error rates of the proposed DC method decreased 
comparing with the state-of-the-art fast SVM solvers, e.g., 
reducing error rates by 10.53% and 7.53% on RCVl and 
covtype datasets respectively. 

Index Terms — Feature space decomposition, feature 
space division, fusion, divide-and-conquer, classification 

1. INTRODUCTION 

Typical kernel-based classification, such as Support Vector 
Machines (SVMs) m and Kernel Ridge Regression (KRR) 
El, usually employs Radial Basis Functions (RBFs) as the 
kernel, for RBFs can effectively delineate the distribution of 
the data by using mixtures of Gaussian models. Furthermore, 
RBFs can map the input features into the intrinsic space El 
that is spanned by infinite-dimensional vectors. This corre¬ 
spondingly increases the opportunity of creating a discrimi¬ 
nant hyperplane in the empirical space El, subsequently en¬ 
hancing discriminability. However, when input dimensions 
are sufficiently large, calculation of a kernel matrix becomes 
a burden. Moreover, RBFs may lead to overfitting due to in¬ 
finite dimensions. To deal with such problems, rather than 
using conventional RBFs, Wu et al. IH proposed using Trun¬ 
cated Radical Basis Functions (TRBFs) to avoid generating 


infinite dimensions in the intrinsic space. Furthermore, they 
also devised an intrinsic data matrix, which was derived from 
a finite-decomposable kernel, to replace calculation of kernel 
matrices in the empirical space. Therefore, the time complex¬ 
ity was saved from original 0{N^) to min{N^, J^N -f J^) 
for KRR, where N is the number of instances, and J is the 
number of feature dimension expanded by TRBFs. More¬ 
over, avoiding direct calculation of kernel matrices effectively 
resolved the need for matrix expansion. 

The success of TRBF-based method relies on dimensional 
reduction in the intrinsic space and the conversion from em¬ 
pirical space to intrinsic space. Although computational load 
is relieved without losing too much accuracy, however, that 
method ID did not improve discriminability and separability 
between features. Furthermore, the algorithmic architecture 
of that method did not support distributed processing, es¬ 
pecially when mainstream toolboxes like Apach Hadoop 
(hadoop.apache.org) and Spark (spark.apache.org) adopt 
divide-and-conquer strategy in their implementation. Propos¬ 
ing a new architecture that supports divide-and-conquer com¬ 
putation correspondingly becomes necessary. 

In response to such a demand, several divide-and-conquer 
classifiers El,llSl based on kernel tricks have been developed 
so far. Zhang et al. used divide-and-conquer KRR El to 
support computation of large-scale data. Firstly, their method 
randomly partitioned a dataset into subsets of equal size. 
Local solutions were subsequently computed by using KRR 
based on each subset. By averaging the local solutions, a 
global predictor was therefore obtained. Instead of using ran¬ 
domized data selection as Zhang et al. did, Hsieh et al. El 
focused on systematic data division before applying divide- 
and-conquer classifiers to the data. In their approach, kernel 
K-means clustering was performed to select the representa¬ 
tives of the entire input data. Next, the members of a subset 
were selected based on one representative. Their experimen¬ 
tal result showed a favorable accuracy when systematic data 
division was used. 

Although the above-mentioned approaches realized divide- 
and-conquer concept in their algorithms, overfitting of kernel 
space was not fully addressed and resolved. To deal with the 


aforementioned problems, this study proposes 

1) A novel approach for feature-space decomposition, 
where the original feature space is converted to subspaces. 
Besides, the bases of each subspace are reranked according 
to their importance. 

2) A divide-and-conquer structure that allows indepen¬ 
dent local classihers to create discriminant hyperplanes based 
subspaces rather than the entire empirical space. This lowers 
computational complexity while avoiding overhtting prob¬ 
lems. 

The rest of this paper is organized as follows. Section |2] 
introduces the overview of the proposed method. Section [3 
then describes details of the proposed feature-space decom¬ 
position and fusion method. Next, Section]?] summarizes the 
performance of the proposed method and the analysis results. 
Conclusions are hnally drawn in Section ]5] 


2. SYSTEM OVERVIEW 


Given an Mx A^ data matrix X with N instances and M fea¬ 
tures and a IxA^ label vector y, denote the feature space as 
n, and X are the projection of the N instances on 17. We hrst 
define the feature-space decomposition method D = {T, /}, 
where T is a feature-space transform function, and / is a set 
of feature index groups. 

The decomposition method D contains five sub-methods 
which are discussed in Section lrTl namely. Random Decom¬ 
position (RD), Principle Component Analysis (PCA), Dis¬ 
criminant Component Analysis (DCA), Block Cholesky De¬ 
composition (BCD) and Approximate Block Decomposition 
(ABD). Furthermore, each have anMxM sub-transform ma¬ 
trix, denoted as and 

Also, each contains a subset of feature index groups, e.g., 
JRD ^ {IRDIIRD c {1,2,... ,M},z = 
where is the number of feature subspaces decomposed 
by RD sub-method, respectively. As for T, we have 


17* = r(i7), a:* = T{x) = wx 
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Vt^SCD 

WABD 
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fi on X* and R = 

V Rh 

discrete labels or continuous prediction values. The system 
generates the output based on R using fusion methods which 
are discussed in Section ]?^ 

3. PROPOSED DIVISION AND FUSION METHODS 
3.1. Feature-Space Decomposition 

Section 2 shows that the merit of the proposed method is, it 
can perform classification within the subspaces and ignores 
the dependance among subspaces. Theoretically, decomposi¬ 
tion method should be able to reduce the dependance as much 
as possible between any two feature subspaces while remain¬ 
ing dependance within the subspaces. This is the reason for 
conducting transformation on the feature space before divi¬ 
sion. 

Among all the sub-methods in this study, the simplest idea 
is RD which directly decompose the feature space based on 
I. Its Wrd is an M x M identity matrix. 

As for PCA, we conduct PCA on the data matrix X and 
split up the features according to I. Since PCA diagonizes the 
feature covariance matrix S, this method eliminates the rele¬ 
vance of different features among and within subspaces. If the 
data obey Gaussian distribution, the PCA also eliminate the 
dependance of features among and within feature subspaces. 

DCA also conducts orthogonal transformation like PCA, 
while its discriminant matrix is [Su, + pI]~^S , where Sw is 
the within-class scatter matrix, and p is the ridge parameter 
131 . We have 

( 2 ) 

where I is the number of classes, Ni represents the number 
of samples in class I, and specifies the average point of 
class. We conduct generalized eigenvector decomposition ID 
to obtain the eigenvectors i^i, V 2 ,..., i^M^nd eigenvalue matrix 
Ai, A 2 ,..., Am, such that 

Svk = Xk[Sw + pl\vk,k = 1,2,..., M (3) 




where W and 17* are respectively the transform matrix and 
the new feature space. As for /, we have / = {/^^, 

• • •, , and the total number of subspaces is h = -f 

fiPCA ^ j-jjg sub-methods need to be used 

in real practice. If some are not applied, the corresponding 
wMethod and I Method can just be empty. 

The original feature space 17 is first transformed to 17* by 
T and then decomposed into subspaces 17^“, ..., 17’J^ by /; 

all the instances are first projected X* and subsequently de¬ 
composed into X*, X 2 ,..., Ai^.Then, a local classifier fi{i = 
1,2, • • • ,h), e.g., SVMs, KRRs, etc. is trained using data ma¬ 
trix X*. Let row vectors Ri = fi{X*) denote the results of 


and the transform matrix is defined as Wdca = [ri, V2, 

• • ■, vm]. Computing S and Su, both enjoys 0{M^N) com¬ 
plexity. As [5'u, + pi] can hardly be singular, the complex¬ 
ity of generalized eigenvalue decomposition equals that of 
Afel^u, -I- pI]~^SRk = XkRk,k = 1,2,..., M, which is of 
0{M^) time complexity. Therefore, the total complexity of 
DCA is 0{2M‘^N -f M3). 

BCD exploits a blocked Doolittle Algorithm, which is a 
form of Gaussian transformation rather than the orthogonal 
transformation, to eliminate the relevance among subspaces 
while remaining relevance within subspaces. For a symmetri¬ 
cal block matrix A , we eliminate the first row and column of 





blocks, as shown in Equation|4] 


Table 1. Time complexity of different transformation meth¬ 
ods. 
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, and the divi- 


\ —^fcl^ii^ Ok2 ■■■ hk ) 

sion of blocks remains the same. The subscript of B indicates 
the row and column it eliminates. Iteratively, we subse¬ 
quently generate B 2 ,- ■■ ,Bk to sequentially eliminate the rest 
rows and columns of blocks. The main goal of BCD is se¬ 
quentially block-diagonizing the discriminant matrix based 
on the blocked Doolittle Algorithm. As shown in Algo¬ 
rithm [T] X is firstly rearranged to generate X according to 
/SCD, jjj MatrixSplit(X, /SCD) splits AT into Xi, Xa, 
• • • , Xj^BCD according to . The discriminant matrix 

of BCD is S. Function BlockedDoolittle(X, I) generates 
Bi{i = 1, 2, • • ■ , based on the idea of Equation |4] to 

eliminate the ith row and column of S. The BCD transform 
matrix is Wbcd = By^BcoBfiBCD _i ■ ■ ■ Bi. Comparing to 
BCD, PCA needs to do an M x M matrix inversion, whose 
complexity is 0{M^) on non-sparse matrix, whereas BCD 
only uses an — x — matrix for times if divided 

equally, which only costs — of the time of PCA. 


Algorithm 1 [AT*, Wbcd] = BCD(A:, I) 


{Xi,X 2 , ...,X„} =MatrixSplit(X, I) 

X = [Xi,X2,...,}^] _ 

f Sll ■■■ Sim \ 

s = xx'^ = 


\ s ml ■ ■ ■ Smm j 

for 2 = 1 to do 

Bi =BlockedDoolittle(5', 2 ) 

5 = 

end for 

W^CD = Bi^bcdB,^bcd_i---Bi 

X* = W^^^X 


Besides the aforementioned orthogonal transformation of 
PCA and DCA, as well as the Gaussian transformation of 
BCD, we also propose an approximate orthogonal transfor¬ 
mation on which the ABD method is based. First we define a 
new operator 0 as Definition [T] 

Definition 1 For two blocked matrix A = {A^j and B — 
{Bij} with the same size and division of blocks, define oper¬ 
ator s.t. 


A®B = 


XU 


^ml 


Xln 


(5) 


T 

Complexity 

Detail 

RD 

0{N) 

Unsupervised, identity transform 

PCA 

0{M'^N + Mf 

Unsupervised, orthogonal transform (OT) 

DCA 

0(2M'-‘N + M'f 

Supervised, OT 

BCD 

OiJVBN + 

Unsupervised, Gaussian transform 

ABD 

0(MmN m^) 

Unsupervised, approximate OT 


where Xij equals the sum of all the elements of A^j element¬ 
wise multiply Bij. 

/ Xu ■■■ XiN \ 

We rewrite X as where 

V Xf,ABDi ■■■ Xf^^ABDpi / 

we divide each instance into m vectors according to I. The 
discriminant matrix is AT® X using this division. By conduct¬ 
ing eigenvector decomposition on the discriminant matrix, we 
have 


X ®X = V^ KV 


(6) 


where V = {vij}, and each column of V is an eigenvector. 
The transform matrix is 


Wabd = 


fiifii 





(7) 


If there are approximately equal number of features in each 
subset, computing X ® X yields 0{MmN) complexity and 
the eigenvalue decomposition costs 0{rrh). Therefore, the 
total complexity of ABD is 0{MmN -f m^). 

Table [1] shows the time complexity and details of the 
aforementioned sub-methods. By combining the five meth¬ 
ods together, D includes both supervision (i.e., DCA) and 
unsupervision (i.e., RD,PCA,BCD and ABD) in transforma¬ 
tion as well as four transformation methods. 

3.2. Feature Subspace Fusion 

After obtaining the classification result matrix R from lo¬ 
cal classifier, we weight the outcome of each subspace by 
training a global classifier /„+i by using i? as a data ma¬ 
trix and y as labels. The output of /n+i is the final predic¬ 
tion result. Observations show that m < 50 << N and 
TRBFKRRH generates favorable results for /n+i. As the 
training complexity of TRBFKRR is min{N^,J^N -f J^), 
m-\- p 
P 

to train on data matrix with a large number of instances and a 
small number of features like R. 


where J = 


, and p is TRBF order. It is efficient 


4. EXPERIMENTAL RESULT 

In this section, we use LibLinear H and DCSVM g) as 
local classifiers • • ,fh in our system respectively and 






















Table 2. Decomposition setting. The Ns and Np stand for number of subspaces and number of features in one subspace. 


Dataset 

Proposed Method 

Settings 

RD 

PCA 

DCA 

BCD 

ABD 

news20 

DC-Liblineai-TRBF2KRR 

Ns 

2 

0 

0 

0 

10 

Np 

677596 

0 

0 

0 

135519 

RCVl 

DC-Liblineai-TRBF3RR 

Ns 

4 

0 

0 

0 

4 

Np 

23618 

0 

0 

0 

23618 

covtype 

DC-DCSVM-TRBF2KRR 

Ns 

4 

4 

4 

4 

4 

Np 

40 

40 

40 

27 

27 

census 

DC-DCSVM-LibLinear 

Ns 

2 

2 

0 

0 

0 

Np 

300 

300 

0 

0 

0 


Table 3. Dataset statistics. “#” represents “number of’. A 
random 0.9/0.1 split is applied to all news20 dataset. A ran¬ 
dom 0.8/0.2 split is applied to covtype and census dataset. 


Dataset 

# Training Instances 

# Testing Instances 

# Features 

news20 

17,997 

1,999 

1,355,191 

RCVl 

20,242 

677, 399 

47, 236 

covtype 

464,810 

116,202 

54 

census 

159,619 

39,904 

409 


Table 4. Comparison of linear classification on real world 
datasets. 



news20 

RCVl 


Time (s) 

En'or Rate (%) 

Time (s) 

Error Rate (%) 

Proposed 

32 

2.87 

3.1 

3.06 

LibLinear 

2 

3.26 

0.3 

3.84 

SVMlight 

467^ 

2.74' 

15.5 

3.42 

BSVM 

437^ 

2.73' 

13.2 

3.68 

L2-SVM-MFN 

98^ 

2.86' 

0.5 

3.53 


Table 5. Comparison of nonlinear classification on real world 
datasets. 



covtype 

census 


c = 

32,7 = 32 

c = 512,7 = 2-" 


Time (s) 

Error Rate (%) 

Time (s) 

Error Rate (%) 

Proposed 

7537 

3.56 

1459 

5.0 

DCSVM(early)^ 

672 

3.88 

261 

5.1 

DCSVM" 

11414 

3.85 

1051 

5.8 

LibSVM" 

83631 

3.85 

2920 

5.8 

LaSVM" 

102603 

5.61 

3514 

6.8 

Cascades VM" 

5600 

10.49 

849 

7.0 

LLSVM" 

4451 

15.79 

1212 

7.2 

EastFood" 

8550 

19.9 

851 

8.4 

SpSVM" 

15113 

16.63 

3121 

9.6 

LTPU" 

11532 

16.75 

1695 

8.0 


Non-Linear Classification: DCSVM is set as local 
classifiers for nonlinear classification. Interestingly, in DC- 
DCSVM-TRBFKRR, divide-and-conquer process is con¬ 


tested the results on large scale datasets (i.e., either M or 
N is larger than 10^). Our methods are notated as “DC- 
classifierl-classifier2”, where “classifierl” indicates the clas¬ 
sifier used for • • ,/m, and “classifier2” is for fn+i- All 

the experiments are conducted on an Intel Core i7 2.1GHz 
CPU and 8G RAM machine. The datasets tested in this 
paper are shown in Table [3 and can be downloaded from 
http://WWW.csie.ntu.edu.tw/~cjlin/libsvmtoo 
or UCI Machine Learning Repository. 

Feature-Space Decomposition Setting; Table |2] shows 
the decomposition setting in our experiments. For data ma¬ 
trices with high feature dimensions, e.g., news20, RCVl, we 
just use RD and ABD with relatively low computational com¬ 
plexity. For data matrices with low feature dimensions, all the 
transformation methods can be combined together to achieve 
a lower error rate. We use LibLinear as the global classifier 
when dealing with census dataset, as its proportion of pos¬ 
itive and negative instances are 0.06/0.94, which can cause 
bias when TRBFKRR is applied. 

Linear Classification: Linear classification is conducted 
on news20 and RCVl datasets. LibLinear is exploited as lo¬ 
cal classifiers /i,/2,- • • ,/m, and TRBFKRR is used as global 
classifier /„+i in our system. We compare our results with 
four common fast linear SVM solvers, namely, Liblinear 
El, SVMlight El, BSVM Uni and L2-SVM-MFN HU. As 
shown in Table @1 our methods either have advantages on 
training times or error rates. . 


ducted on both instance dimension and feature dimension in 
the method. We evaluate it on covtype dataset and compare 
with the results of the other SVM methods by Hsieh et al. 
©, as is shown in Table |5] The proposed method achieve the 
lowest error rate with relatively low time complexity in both 
covtype and census datasets. 

Moreover, comparing to directly training a TRBFKRR 
Icfe^dMfe© asii^/ the whole data matrix, DC-TRBFKRR- 
TRBFKRR greatly reduces the training complexity from 
min{N^, J'^N -f to min{N^, -f ^), which enables 
TRBFKRR training on data matrix with large N and M. 

5. CONCLUSION 

This paper presents a feature-space decomposition classifica¬ 
tion method including five sub-methods. The experimental re¬ 
sults show that our divide-and-conquer classification scheme 
can reduce error rates (e.g., reduce error rates by 10.53% and 
7.53% in covtype and RCVl datasets), comparing to train¬ 
ing directly using the whole datasets, and outperform state- 
of-the-art fast SVM solver by reducing overfitting problem. 
The future work will focus on providing theoretical analysis 
for feature-space decomposition and its effects on divide-and- 
conquer classification. 


^Results are cited from Keerthi et al. na 
^Results are cited from Hsieh et al. 
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